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We present a new solver for coupled nonlinear elliptic partial differential equations 
(PDEs). The solver is based on pseudo-spectral collocation with domain decomposi- 
tion and can handle one- to three-dimensional problems. It has three distinct features. 
First, the combined problem of solving the PDE, satisfying the boundary conditions, and 
matching between different subdomains is cast into one set of equations readily accessible 
to standard linear and nonlinear solvers. Second, touching as well as overlapping subdo- 
mains are supported; both rectangular blocks with Chebyshev basis functions as well as 
spherical shells with an expansion in spherical harmonics are implemented. Third, the 
code is very flexible: The domain decomposition as well as the distribution of collocation 
points in each domain can be chosen at run time, and the solver is easily adaptable to 
new PDEs. The code has been used to solve the equations of the initial value problem 
of general relativity and should be useful in many other problems. We compare the new 
method to finite difference codes and find it superior in both runtime and accuracy, at 
least for the smooth problems considered here. 
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1. INTRODUCTION 



Elliptic partial differential equations (PDE) are a basic and important aspect of 
almost all areas of natural science. Numerical solutions of PDEs in three or more 
dimensions pose a formidable problem, requiring a significant amount of memory 
and CPU time. Off-the-shelf solvers are available; however, it can be difficult to 
adapt such a solver to a particular problem at hand, especially when the compu- 
tational domain of the PDE is nontrivial, or when one deals with a set of coupled 
PDEs. 

There are three basic approaches to solving PDEs: Finite differences, finite ele- 
ments and spectral methods. Finite differences are easiest to code. However, they 
converge only algebraically and therefore need a large number of grid points and 
have correspondingly large memory requirements. Finite elements and spectral 
methods both expand the solution in basis functions. Finite elements use many 



1 



subdomains and expand to low order in each subdomain, whereas spectral methods 
use comparatively few subdomains with high expansion orders. Finite elements 
are particularly well suited to irregular geometries appearing in many engineer- 
ing applications. For sufficiently regular domains, however, spectral methods are 
generally faster and/or more accurate. 

Multidomain spectral methods date back at least to the work of Qrszag|2^|. 
In a multidomain spectral method, one has to match the solution across different 
subdomains. Often this is accomplished by a combination of solves on individual 
subdomains together with a global scheme to find the function values at the in- 
ternal subdomain boundaries. Examples of such global schemes are relaxational 
iteratio n|l3| , an influence matrix fl9|, or the spectral projection decomposition 
method[|14|. For simple PDEs like the Helmholtz equation, fast solvers for the sub- 
domain solves are available. For more complicated PDEs, or for coupled PDEs, 
the subdomain solves will typically use an iterative solver. One drawback of these 
schemes is that information from the iterative subdomain solves is not used in the 
global matching procedure until the subdomain solves have completely converged. 
The question arises whether efficiency can be improved by avoiding this delay in 
communication with the matching procedure. 

In this paper we present a new spectral method for coupled nonlinear PDEs 
based on pseudospectral collocation with domain decomposition. This method 
does not split subdomain solves and matching into two distinct elements. Instead 
it combines satisfying the PDE on each subdomain, matching between subdomains, 
and satisfying the boundary conditions into one set of equations. This system of 
equations is then solved with an iterative solver, typically GMRESQ. At each iter- 
ation, this solver thus has up-to-date information about residuals on the individual 
subdomains and about matching and thus can make optimal use of all information. 

The individual subdomains implemented are rectangular blocks and spherical 
shells. Whereas either rectangular blocks (see e.g. [H| or spherical shells ||l6f 
have been employed before, we are not aware of work using both. The code supports 
an arbitrary number of blocks and shells that can touch each other a nd / or overlap. 



Moreover, the operator S at the core of the method (see section turns out 
to be modular, i.e. the code fragments used to evaluate the PDE, the boundary 
conditions, and the matching conditions are independent of each other. Thus the 
structure of the resulting code allows for great flexibility, which is further enhanced 
by a novel point of view of the mappings that are used to map collocation coor- 
dinates to the physical coordinates. This flexibility is highlighted in the following 
aspects: 

• The user code for the particular PDE at hand is completely independent from 
the code dealing with the spectral method and domain decomposition. For a 
new PDE, the user has to supply only the code that computes the residual 
and its linearization. 

• Mappings are employed to control how collocation points and thus resolution 
are distributed within each subdomain. New mappings can be easily added 
which are then available for all PDEs that have been coded. 

• The solver uses standard software packages for the Newton-Raphson step, 
the iterative linear solvers, and the preconditioning. Thus one can experi- 
ment with many different linear solvers and different preconditioncrs to find 
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an efficient combination. The code will also automatically benefit from im- 
provements to these software packages. 

• The code is dimension independent (up to three dimensions). 

• Many properties of a particular solve can be chosen at runtime, for example 
the domain decomposition, the mappings used in each subdomain, as well as 
the choice of the iterative solver. The user can also choose among differential 
operators and boundary conditions previously coded at runtime. 

In the next section we recall some basics of the pseudo-spectral collocation 
method. In section || we describe our novel approach of combining matching with 
solving of the PDE. For ease of exposition, we interweave the new method with 
more practical issues like mappings and code modularity. The central piece of 



our method, the operator S, is introduced in section 3.3 for a one-dimensional 
problem and then extended to higher dimensions and spherical shells. In section 
|] we solve three example problems. Many practical issues like preconditioning 
and parallelization are discussed in this section, and we also include a detailed 
comparison to a finite difference code. 



2. SPECTRAL METHODS 

We deal with a second order nonlinear elliptic partial differential equation or 
system of equations, 

(Afu)(x) =0, x e V, (1) 

in some domain D C R d with boundary conditions 

g(u)(x) = 0, x S dV. (2) 

The function u can be a single variable giving rise to a single PDE, or it can be 
vector-valued giving rise to a coupled set of PDEs. Throughout this paper we 
assume that the problem has a unique solution. We also absorb a possible right- 
hand side into the elliptic operator AT. 

The fundamental idea of spectral methods is to approximate the solution to the 
PDE (0) as a series in some basis functions 

N 

u(^/»(i)e^«A(4 (3) 

The coefficients are called the spectral coefficients. The power of spectral meth- 
ods stems from two simple facts: 

1. When approximating a smooth function with a series (^), the error of the 
approximation decreases exponentially with the number of basis functions N . 
Hence u^ N > will converge toward the true solution u of the PDE exponen- 
tially, provided u is smooth and one can determine the spectral coefficients 
Uk sufficiently well. 

2. One can evaluate the action of M on the function vS N ' exactly (i.e. up to 
numerical round-off) . This fact allows one to find the Uk accurately. 
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The second fact arises because the derivatives of the basis functions are known 
analytically, and so by Eq. (||), 

fc=0 

and similarly for higher derivatives. 

In order to compute the spectral coefficients u k we use pseudo- spectral colloca- 
tion where one requires 

(JVu {N) ){x l ) = 0, £ = 0,...,JV. (5) 

The points Xi are called collocation points, and are chosen as the abscissas of the 
Gaussian quadrature associated with the basis set This choice of collocation 
points can be motivated by considering J Uj\fu^ N ^)(x)] dx. Evaluating this inte- 
gral with Gaussian quadrature, we find by virtue of Eq. (|^) 

2 N 2 

(AA/W) (x) dx^Y. w * [(^ u(N) ) 0*) = 0, ( 6 ) 

i=0 



where Wi are the weights of the quadrature. We see that Afu^ must be small 
throughout V and thus the function u^ N ^ satisfying Eqs. (||) must be close to the 
true solution. More rigorous treatments of the pseudospectral collocation method 
can be found in the literature [fia B, pfl. 



2.1. Chebyshev polynomials 

Chebyshev polynomials are widely used as basis functions for spectral methods. 
They satisfy the convenient analytical properties of "classical" orthogonal polyno- 
mials. Their defining differential equation is a singular Sturm-Liouville problem, 
and so Chebyshev expansions converge exponentially for smooth functions u inde- 
pendent of the boundary conditions satisfied by u [|l5| |) . 

Chebyshev polynomials are defined by 

T k (X) = cos(fcarccosX), Ie[-l,l]. (7) 

They are defined on the interval X G [ — 1 , 1] only; usually one needs to map the 
collocation coordinate X e [—1,1] to the physical coordinate of the problem, x £ 
[a, b]. We use the convention that the variable X varies over the interval [—1,1], 
whereas x is defined over arbitrary intervals. We will describe our approach to 
mappings below in the implementation section. 

For an expansion up to order N (i.e. having a total of JV" + 1 basis functions) 
the associated collocation points are 

(iir \ 
-J, i = 0,...,N. (8) 

Define the real space values 

N 

u i = u^(X i )=J2u k T k (X i ). (9) 

fe=0 
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Using the discrete orthogonality relation 



2 N 1 

i=0 



with 



2 k = or k = N 

1 jfc= 1,... ,JV- 1, ( ^ 



we can invert (O) and find 



Both matrix multiplications (|^) and ( |l2|) can be performed with a fast cosine 
transform in 0(N In AT) operations, another reason for the popularity of Chebyshcv 
basis functions. 

There are the same number of real space values Ui and spectral coefficients Uk, 
and there is a one-to-one mapping between {ui} and {ufc}. Hence one can represent 
the function by either {ui} or 

The spectral coefficients of the derivative, 

^ r (X) = ^u' k T k (X), (13) 

fe=0 

are given by the recurrence relation 

Ui = u' l+2 + 2(i + l)u i+ i, i = 1, . . . , N - 1, 



"0 = 2 fi 2 + u l) 

with mat+i = utv = 0. The coefficients of the second derivative, 



(14) 



72 (JV) N ~ x 

a L^_ {X )=Y,u'lT k {X), (15) 

fe=0 



are obtained by a similar recurrence relation, or by applying (p_4J) twice. 



2.2. Basis functions in higher dimensions 

In higher dimensions one can choose tensor grids of lower dimensional basis 
functions. For example, a <i-dimensional cube [— 1, l] d can be described by Cheby- 
shev polynomials along each coordinate axis. For a three-dimensional sphere or 
spherical shell, tensor products of spherical harmonics for the angles and a Cheby- 
shev series for the radial coordinate |l6| are used. It is also possible to expand the 
angular piece in a double Fourier-series pl| . 
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2.3. Domain Decomposition 

If the computational domain T> has a different topology than the basis functions, 
then an expansion in the basis functions cannot cover T> completely. Moreover, the 
particular problem at hand might require different resolution in different regions of 
the computational domain which will render a single overall expansion inefficient. 

One circumvents these problems with domain decomposition. The computa- 
tional domain T> is covered by N-p subdomains 

V=\JV^ (16) 



each having its own set of basis functions and expansion coefficients: 

\x) = J2rt ) ^ ) (x), x&V^, » = 1,...N V . (17) 



u 

k=0 



Here denotes the approximation in the /i-th. domain, and we have dropped the 
additional label N denoting the expansion order of u^' . The individual subdomains 
T>n can touch each other or overlap each other. To ensure that the functions 
— each defined on a single subdomain T> M only — actually fit together and form a 
smooth solution of the PDE (|l|) on the full domain V, they have to satisfy matching 
conditions. In the limit of infinite resolution, we must have that 

• for touching subdomains T>^ and T> v the function and its normal derivative 
must be smooth on the surface where the subdomains touch: 

u M {x) = u {v \x) 

9u (m) xedv^ndv, (18) 

dn ^ ^ dn ^ ^ 

(The minus sign in the second equation of (|lj) occurs because we use the 
outward-pointing normal in each subdomain.) 

• for overlapping subdomains T>^ and T> v the functions and must be 
identical in T>^ n T) v . By uniqueness of the solution of the PDE, it suffices to 
require that the functions are identical on the boundary of the overlapping 
domain: 

u^(x) =u^(x), i6 9(P„nB„). (19) 

We will see in the next section how these conditions are actually implemented 
in the code. 

3. IMPLEMENTATION 

In this section we describe our specific approaches to several aspects of multi- 
dimensional pseudo-spectral collocation with domain decomposition. 
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3.1. One-dimensional Mappings 

Chebyshev polynomials are denned for X G [—1,1]. Differential equations in 
general will be defined on a different interval x S [a, b]. In order to use Chebyshev 
polynomials, one introduces a mapping 

X: [a, &]-»■ [-1,1], x^X = X(x) (20) 

that maps the physical coordinate x onto the collocation coordinate X. 

One could explicitly substitute this mapping into the PDE under consideration. 
Derivatives would be multiplied by a Jacobian, and we would obtain the PDE on 
the interval [—1,1]. For example, the differential equation in the variable x 

3 2 u 

— + u = 0, xe[a,b], (21) 
becomes the following differential equation in the variable X: 

X ' 2 ^ + X "§x- + U = > le| - ul - (22) 

where X' = dX/dx and X" = d 2 X/dx 2 . Now one could expand u{X) in Chebyshev 
polynomials, compute derivatives d/dX via the recurrence relation ( |l4j ) and code 
Eq. ( g2|) in terms of du/dX. This approach is common in the literature ||, |l7j| . 
However, it has several disadvantages: As one can already see from this simple 
example, the equations become longer and one has to code and debug more terms. 
Second, and more important, it is inflexible, since for each different map one has 
to derive and code a mapped equation (p2]). A priori one might not know the 
appropriate map for a differential equation, and in order to try several maps, one 
has to code the mapped equation several times. Also, for domain decomposition, a 
different map is needed for each subdomain. 

We propose a different approach. We still expand in terms of Chebyshev poly- 
nomials on X G [—1,1] and obtain the physical solution via a mapping X{x), 

N 

u(z) = ^6 fc T fe (X(x)), (23) 

fe=0 

and we still compute du(X) / dX and d 2 u(X)/dX 2 via the recurrence relation (H). 
However, now we do not substitute du(X)/dX and d 2 u(X) / dX 2 into the mapped 
differential equation, Eq. (|2^). Instead we compute first numerically 

~dx~- X ^X~ (24) 
d 2 u(x) _ vl 2 d 2 u{X) „ du(X) 

-dx^- X ~^X^ +X ~~dX~ (25) 

and substitute these values into the original physical differential equation (^lj) . The 
collocation points are thus mapped to the physical coordinates 

H =X~ 1 (X i ). (26) 

This approach separates the code into three distinct parts: 
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1. Code dealing with the basis functions: transforms between collocation space 
X and spectral space, evaluation of derivatives d/dX via recurrence relations. 
This code depends only on the collocation coordinates X G [—1,1] (and on 
the angular coordinates 6, <fi for spherical shells). 

2. Mappings that map between collocation coordinate X and physical coordi- 
nates x. 

3. The "user code" implementing the physical PDE [in our example Eq. (^jj)] 
that deals only with the physical coordinate x. 

These three elements are independent of each other: 

• A user who wants to code another differential equation has only to write the 
code that evaluates the differential operator M in the physical space with 
physical derivatives. Then immediately all previously coded mappings are 
available for this new differential equation, as well as all basis functions. 

• In order to introduce a new mapping, one has to code only four functions, 
namely X(x), its inverse x(X), as well as the derivatives X'(x) and X"(x). 
This new map can then be used for any differential equation already coded 
or to be coded later. 

• In order to switch to a different set of basis functions, one has only to code 
the transforms and the recurrence relations for the derivatives. 

In practice we use three different mappings 

linear: X(x) = Ax + B 

log: X(x) = A \og{Bx + C) (27) 
inverse: X(x) — — + B 

In each case the constants A, B are chosen such that [a, b] is mapped to [—1,1]. The 
log mapping has one additional parameter which is used to fine-tune the relative 
density of collocation points at both ends of the interval [a, b]. We show the effects 



of different mappings in our first example in section 4.1 



3.2. Basis functions and Mappings in higher Dimensions 

3.2.1. Rectangular Blocks 

In order to expand in a d-dimensional rectangular block, 

V = [ai,6i] x [02,62] x ... x [a d ,b d ], (28) 

we use a tensor product of Chebyshev polynomials with a l-d mapping along each 
coordinate axis: 

Ni N 2 N d 

u( Xl ,... ,x d )= ]T ^■•■^i tl ..,A(x (1) (ii))-T fcd (lMw). (29) 

fc 1= 0fc 2 =0 fe d =0 

We use d mappings 

X^ : [ai,h] -+ [-1,1], l = l,.-.d, (30) 
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and the collocation points in physical space are the mapped collocation points along 
each dimension, 

x ii—id = (^ii i • " '• E L'')' 

where the coordinate along the i-th dimension xf^ is given by Eq. (|2^) using 

Note that such a (i-dimensional rectangle has as many spectral coefficients 
Uk-L—kd as g r id point values u^...^ = u(xi 1) ... ,Xi d ). Therefore we can equiva- 
lently solve for the spectral coefficients or the real space values. We will solve for 
the real space values Ui 1 ...i d . 



3.2.2. Spherical Shell 

In a spherical shell with inner and outer radii < R\ < R2 we use a mapping 
for the radial coordinate. A function u(r, 9, fa) is thus expanded as 

Nr L I 

u(r,9,d>) = J2Yl E «HmT fc (X(r))y Im (M), (32) 

k=0 1=0 m=-l 

where real- valued spherical harmonics are used: 

v fa a\ - / p i m {cos6)cos(m^), m>0 
Yi m (0,<j)) = < i m | . 33) 

[_ P] (cos 0) sin(|m|0), m<0 

Pj m (cos#) are the associated Legendre polynomials. Associating the sin-terms with 
negative m is not standard, but eliminates the need to refer to two sets of spectral 
coefficients, one each for the cos-terms and the sin-terms. The radial mapping 



X : [Ri, R2] — * [—1,1] can be any of the choices in Eq. (27). The radial collocation 
points ri,i = 0, . . . ,N r are given by Eq. (|26|). 



For the angle <ft, Eq. (|33| ) leads to a Fourier series with equally spaced azimuthal 
collocation points 

2ni 

<t>i = ir , i = O,l,...,JV -l. (34) 

There is a total of No = L + 1 angular collocation points which are the abscissas 
of Gauss-Legendre integration. 

We employ the software package Spherepack|lJ which provides routines to com- 
pute the collocation points, transforms and angular derivatives. 

In order to resolve the full Fourier series in <fi up to m — L, one needs N<j, > 
2L + 1, since for N$ — 2L, the term sin(L^) vanishes at all collocation points fa. 
We use = 2(L + 1) since FFTs are more efficient with an even number of points. 

The expansion ( |32"| ) has a total of (N r + 1)(L + l) 2 spectral coefficients but a to- 
tal of (N r + l)iV e A^ = 2(A^ r + 1){L + 1) 2 collocation points. This means a spherical 
shell has more collocation points than spectral coefficients and the expansion ( |32"| ) 
approximates the grid point values in a least-square sense onlyQ. Performing a 
spectral transform and its inverse will thus project the grid point values into a sub- 
space with dimension equal to the number of spectral coefficients. The implications 



of this fact for our code are discussed below in section 3.6 
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FIG. 1 Illustration of matching with three subdomains in one dimension. Subdo- 
mains 1 and 2 touch each other, and subdomains 2 and 3 overlap, denotes the 
coordinate of the i-th collocation point of the [i-th subdomain, and denotes 
the function values at the grid points. 



3.2.3. Representation of vectors and derivatives 

In both kinds of subdomains, rectangular blocks and spherical shells, we ex- 
pand the Cartesian components of vectors, and we compute Cartesian derivatives, 
d/dx,d/dy,d/dz. These quantities are smooth everywhere, and thus can be ex- 
panded in scalar spherical harmonics. By contrast, the spherical components of a 
vector field in a spherical shell are discontinuous at the poles and cannnot be ex- 
panded in scalar spherical harmonics. One would have to use e.g. vector spherical 
harmonics |2^, |2^| . 

For a spherical shell we provide an additional wrapper around the basis func- 
tions and the radial mapping that transforms polar derivatives d/dr,d/d9,d/d<j) 
to Cartesian derivatives. This involves multiplications by sines and cosines of the 
angles 0, </> which can be performed grid point by grid point in real space. Alterna- 
tively, it can be done spectrally by expressing, e.g. sinf? in spherical harmonics, and 
then using addition theorems to reduce products of spherical harmonics to simple 
sums. Carrying out the transformation spectrally is slightly better in practice. 

Representing vectors and derivatives in Cartesian coordinates in both kinds of 
subdomains increases flexibility, too. We can use the same code to evaluate the 
residual in both kinds of subdomains. 

3.3. The operator S 

We now introduce the operator 5, the centerpiece of our method. It combines 
the solution of the PDE, the boundary conditions and matching between different 
subdomains. 

We introduce S first with a simple case, a one-dimensional differential equation 
with a Dirichlet boundary condition at one end and a von Neumann boundary 
condition at the other end: 
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{Afu)(x) 


= 0, 


u(a) 


= A, 






> 


= B. 



a < x <b, (35) 

(36) 

(37) 

To explain our procedure for matching, we assume three domains as depicted in 
Figure [j]. denotes the highest retained expansion order in domain fi; here 
= 3 for all domains. Domains 2 and 3 overlap. Domains 1 and 2 touch so 
that the collocation points x^ and represent the same physical point. The 

function value, however, is represented twice, once assigned to domain 1, M3 , 

(2) 

and once belonging to domain 2: Uq ' . Using just the grid point values within 
one subdomain, we can expand the function in that subdomain and can evaluate 
derivatives. We can also interpolate the function to arbitrary positions x. Thus, 
given values {uf\i = 0, . . . , N^}, we can compute u^(x) for x e [xq , x n ]■ 

In order to determine the unknowns uj , we need one equation per unknown. 
We will first write down these equations and then explain where they come from. 

(AfuM)(x^) =0, i = l,... ,N„-1, f i = l,...,N v (38a) 
u£ ) - A = (38b) 

dui3 \x^)-B = (38c) 



dx 



4 X) - «o 2) = ° ( 38d ) 



2) 



4 2 ) - u (3)(4 2 )) =0 (38f) 

4 3) -« (2) (4 3) )=0 (38g) 
Eq. ( |38a[ ) represents the actual pseudo-spectral equation (|^). It is enforced only 



for c olloc ation points that are not on the boundary of a subd omai n. Eqs. (38b) 



and (38c) encode the boundary conditions. Eqs. ( 38d ) and (38e) describe the 
value and derivative matchi ng a t touc hing subdomain boundaries. These equations 
follow from Eq. (|l8|). Eqs. (|38|) and ( 38g ) perform matching between overlapping 



subdomains as given by Eq. ([191). 

We will view the left-hand side of Eqs. (|3|) as a non-linear operator S. This 
operator acts on the set of grid point values for all subdomains {w^' 1 } (m — 
1, 2, 3, i = 0, . . . , in the example) and returns a residual that incorporates the 
actual pseudo-spectral condition Eq. (||), the boundary conditions, and the match- 
ing conditions between different subdomains. If we denote the vector of all grid 
point values by u, then the discretized version of the partial differential equation 
becomes 

Su = 0. (39) 

The solution of Eq. ( |39| ) clearly is the solution of the partial differential equation 
we want to obtain. By virtue of Eq. ( |39] ) we thus have condensed the PDE, the 
boundary conditions and matching into one set of nonlinear equations. 
We comment on some implementation issues: 
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• The action of the operator S can be computed very easily: Given grid point 
values u, every subdomain is transformed to spectral space and derivatives 
are computed. Using the derivatives we can compute Eqs. (38a), (38e) and 



any boundary conditions that involve derivatives like Eq. (|38cj) . The interpo- 
lations necessary in Eqs. (381) and (38g) are done by summing up the spectral 
series. 

• Su can be computed in parallel: Everything except the matching conditions 
depends only on the set of grid point values within one subdomain. Therefore 
the natural parallclization is to distribute subdomains to different processors. 

• The code fragments implementing the nonlinear operator AA, the boundary 
conditions and the matching conditions are independent of each other. In 
order to change boundary conditions, one has only to modify the code im- 
plementing Eqs. (38b) and (38c). In particular, the code for the matching- 
equations ( |38cj )-( p8g[ ) can be used for any differential operator Af and for any 
boundary condition. 

We have now introduced the operator S in one dimension. Next we address how 
to solve Eq. (^), and then we generalize S to higher dimensions. We present our 
method in this order because the generalization to higher dimensions depends on 
some details of the solution process. 



3.4. Solving Su — 

In this section we describe how we solve the system of nonlinear equations ( |39|) . 
Our procedure is completely standard and requires three ingredients: A Newton- 
Raphson iteration to reduce the nonlinear equations to a linear solve at each iter- 
ation, an iterative linear solver, and the preconditioner for the linear solver. For 
these three steps we employ the software package PETScQ. We now comment on 
each of these three stages. 



3.4-1- Newton-Raphson with line searches 

PETSc[[I] implements a Newton-Raphson method with line searches, similar to 
the method described in p5[ . Given a current guess of the solution, a Newton- 
Raphson step proceeds as follows: Compute the residual 

L = ^Mold (40) 
and linearize S around the current guess of the solution: 

Js—Cy^). (41) 

The Jacobian J is a Ndf x A^Df-dimensional matrix, Ndf being the number of 
degrees of freedom. Next compute a correction Su by solving the linear system 

JSu = -r. (42) 

Finally a line-search is performed in the direction of <5u. Parametrize the new 
solution by 

"new = Hold + A (5u (43) 
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and determine the parameter A > such that the residual of the new solution, 



l|S(u new )ll, (44) 

has sufficiently decreased. Of course, close enough to the true solution, the full 
Newton-Raphson step A = 1 will lead to quadratic convergence. PETSc offers 
different algorithms to perform this line-search. The line search ensures that in 
each iteration the residual does indeed decrease, which is not guaranteed in Newton- 
Raphson without line searches. 



3.4-2. Linear Solve 



In each Newton-Raphson iteration one has to solve Eq. p^), a linear system 
of Ndf equations. For large systems of linear equations, iterative linear solvers B 
are most efficient. Such iterative solvers require solely the ability to compute the 
matrix-vector product Jy_ for a given vector v. Since spectral derivatives and 
spectral interpolation lead to full (i.e. non-sparse) matrices it is impractical to set 
up the matrix J explicitly. One can compute these matrix-vector products instead 
with t he lin earized variant of the code that computes the operator S, i.e. equations 
(B8ah-(38g) and their multidimensional generalizations. Thus our method requires 
the lin eariz ations of t he operator AT [Eq. (38a)] a nd of t he b oundary conditions 
[Eqs. ( 38b ) and (|8q)]. The matching equations ( 38d )-(38g) are linear anyway, 
so one can reuse code from S for these equations. The linearizations are merely 
Frechet derivatives^ of the respective operators evaluated at the collocation points, 
and therefore the Newton-Raphson iteration applied to the discretized equations is 
equivalent to the Newton-Kantorovitch iteration applied to the PDE. 

PETSc includes several different linear iterative solvers (GMRES, TFQR, ...) 
that can be employed for the linear solve inside the Newton-Raphson iteration. 
The choice of linear solver and of options for the linear solver and for the Newton- 
Raphson iteration are made at runtime. This allows one to experiment with dif- 
ferent linear solvers and with a variety of options to find an efficient combination. 
Note that the matching conditions (|l^) and (|l^) lead to a nonsymmetric matrix 
J . Therefore only iterative methods that allow for nonsymmetric matrices can be 
used. 



3.4-3. Preconditioning 

In practice one will find that the Jacobian J is ill-conditioned and thus the 
iterative method employed to solve Eq. ( }4^ ) will need an increasing number of 
iterations as the number of collocation points is increased. The spectral condition 
number k of a matrix is the ratio of largest to smallest eigenvalue of this matrix, 



For second order differential equations discretized with Chebyshev polynomials, 
one finds k cx N 4 , N being the number of grid points per dimension. Solving a 
linear system to given accuracy will require Q [s| O(k) iterations of the Richardson 
method, and 0(^/k) iterations of modern iterative methods like conjugate gradients 
or GMRES. Although modern methods are better than Richardson iteration, it is 
still vital to keep k close to 1. 
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This is achieved with preconditioning. Instead of solving Eq. ( fl2| ) directly, one 
solves 



BJ Su = -Br, (46) 

with the preconditioning matrix B. Now the iterative solver deals with the matrix 
BJ . If B is a good approximation to J'" 1 , then BJ will be close to the identity 
matrix, the condition number will be close to unity, and the linear solver will 
converge quickly. 

Hence the problem reduces to finding a matrix B that approximates J -1 suf- 
ficiently well and that can be computed efficiently. There exist many different 
approaches, most notably finite difference preconditioning |2^| and finite element 
preconditioning |p| ; we will follow a two-stage process proposed by Orszag(2^]. 
First, initialize a matrix Afd with a finite difference approximation of the Jaco- 
bian J. Second, approximately invert Afd to construct B, 

B » App. (47) 

In one spatial dimension Afd is tridiagonal and direct inversion B = A FD is feasi- 
ble. In two or more dimensions, direct inversion of Afd is too expensive; for prob- 
lems in one two-dimensional subdomain, hardcoded incomplete LU-factorizations 
have been developed ||. In our case we have to deal with the additional complex- 
ity that the Jacobian and therefore Afd contains matching conditions. Since we 
choose the domain decomposition at runtime, nothing is known about the particular 
structure of the subdomains. 

We proceed as follows: We initialize Afd with the finite difference approxima- 
tion of J . It is sufficient to include those terms of the Jacobian in Afd that cause 
the condition number to increase with the expansion order. These are the second 
spatial derivatives and the first derivatives from matching conditions and boundary 



conditions, Eqs. (38e) and ( |38cj ) . Including the value matching conditions (38d), 
( |38lD , ( |38gD in Afd improves the ability of the preconditioner to represent modes 
extending over several subdomains and thus decreases the number of iterations, too. 



In the first example in section 4.1 we demonstrate that preconditioning is indeed 
necessary, and that one should precondition not only the second order derivatives, 
but also the matching conditions. Some details about the finite difference approxi- 
mations are given in appendix [a|. 

Having set up Afd we then use the software package PETSc [[I) for the approxi- 
mate inversion of Eq. ( |47|) . PETSc provides many general purpose preconditioners 
that perform the step fl47| ) either explicitly or implicitly, most notably ILU and the 
overlapping Schwarz method. With PETSc we can explore these to find the most 
efficient one. We will describe our particular choices for preconditioning below for 
each example. 



3.5. S in higher dimensions 

Generalizing S to multiple dimensions is conceptually straightforward, since 
Eqs. ( |38| ) generalize nicely to higher dimensions. In order to simplify the matching 
between touching subdomains, we require that on a surface shared by touching sub- 
domains, the collocation points are identical. If, for example, two three-dimensional 
rectangular blocks touch along the cc-axis, then both blocks must have identical 
lower and upper bounds of the blocks along the y and z axis and both blocks must 
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use the same mappings and the same number of collocation points along the y- and 
z-axis. For concentric spherical shells, this restriction implies that all concentric 
shells must have the same number of collocation points in the angular directions. 
With this restriction, matching between touching subdomains remains a point-by- 
point operation. 

For overlapping domains, no restriction is needed. If a boundary point of one 
sub dom ain happens to be within another subdomain, then an equation analogous 
to (381) is enforced using spectral interpolation. 

The actual implementation of the operator S involves bookkeeping to keep track 
of which subdomains overlap or touch, or of what equation to enforce at which grid 
point. When running in parallel, matching conditions have to be communicated 
across processors. We utilize the software package KeLP[||, which provides func- 
tionality to iterate over boundary points of a specific subdomain. It also provides 
routines to handle the interprocessor communication needed for the matching con- 
ditions. 



3.6. Extension of S to Spherical Shells 

Spherical shells have the additional complexity of having more collocation points 
than spectral coefficients, N co i > N spec , at least in our formulation. Transform- 
ing to spectral space and back to real space projects the real-space values into a 
iV spec -dimensional subspace. Since spectral transforms are used for derivatives and 
interpolation, a sphere has effectively only N spec degrees of freedom. If we naively 
try to impose N co i equations, one at each collocation point, and if we try to solve 
for real space values at each collocation point, we find that the linear solver does 
not converge. This happens because more equations are imposed than degrees of 
freedom are available. Thus we cannot solve for the real space values Uijk in a 
spherical shell. 

The next choice would be to solve for the spectral coefficients Ukim as defined 
in Eq. ( |32| ) This is also problematic as it prohibits finite-difference preconditioning. 
One finds guidance on how to proceed by considering the prototypical elliptic op- 
erator, the Laplacian. Application of V 2 to an expansion in spherical harmonics 
yields 



V 2 ^ai m {r)Y Un = J2 

l.rn l,m 



l{l + l)ai m (r) | 1 d f 2 da hn (r) 
r 2 r 2 dr \ dr 



Y lm . (48) 



We see that the different (Zm)-pairs are uncoupled. The angular derivatives will 
therefore be diagonal in spectral space (with diagonal elements — 1(1 + 1) /r 2 ). How- 
ever, one has to precondition the radial derivatives in order to keep the spectral 
conditioning number low and must therefore keep real-space information about the 
radial direction. We therefore solve for the coefficients uu m of an expansion defined 
by 

L I 

u(r,,M)=X; E uu m Y lm (e,<t>). (49) 

l=o m=-l 

This mixed real/spectral expansion has N spec coefficients uu m and retains real 
space information about the radial coordinate necessary for finite difference pre- 
conditioning. In order to precondition the flat space Laplacian in a spherical shell, 
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FIG. 2 Domain decomposition for Laplace equation in a square. The left plot 
illustrates linear mappings in all subdomains and the right plot shows log-linear- 
log mappings along each axis. 



Afd is initialized with the diagonal matrix diag(— 1(1 + l)/r?) for the angular piece 
of V 2 and with finite differences for the radial deriva tive s. More general differential 
operators are discussed in the last example, section 13, and in appendix [§]. 

In order to evaluate Su for a spherical shell, we proceed as follows, u contains 
the coefficients uu m . Transform these coefficients to real space values. This involves 
only an angular transform. Compute boundary conditions, matching conditions, 
and the residual of the nonlinear elliptic operator Af at each collocation point as in 
rectangular blocks. At this stage we have N co i collocation point values, all of which 
should vanish for the desired solution. We transform these values back into the 
coefficients of Eq. (|f9|) and return these coefficients as the residual of the operator 
S. 



4. EXAMPLES 

4.1. V 2 tt — in 2-D 

As a first test, we solve the Laplace equation in two dimensions with Dirichlct 
boundary conditions: 

V 2 u(x,y)=0 (x,y)eV (50) 

u{x,y)=hx{x 2 +y 2 ) (x,y)&dV (51) 

The computational domain T> is a square with side 2L centered on the origin with 
a smaller square with side 2 excised: 

V = {{x,y)\ -L<x,y<L}- {(x,y)\ -Kx,y<l} (52) 

This domain is decomposed into 8 touching rectangles as shown in figure ^|. This 
figure also illustrates the difference between linear mappings and logarithmic map- 
pings. The right plot of figure ^ shows that logarithmic mappings move grid points 
closer to the excised rectangle. For clarity, both plots neglect the fact that the 
Chebyshev collocation points given in Eq. (||) are clustered toward the boundaries. 
We solve Eqs. ( pC| ) and ( [5l| ) for three cases: 

• L = 5 with linear mappings 

• L = 100 with linear mappings 
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FIG. 3 rms-errors of solution of Laplace equation in a square. Stars denote L = 5 
with linear mappings, squares L — 100 with linear mappings, and circles L = 100 
with log mappings. 



• L = 100 with logarithmic mappings. 

Equation ( p0| ) is linear, therefore only one Newton- Raphson iteration with one linear 
solve is necessary. The numerical solution is compared to the analytic solution 
u(x, y) — ln(x 2 + y 2 ). The errors are shown in figure |^. In the small computational 
domain extending only to x,y — ±5, the accuracy of the solution quickly reaches 
numerical roundoff. In the larger domain extending to L = 100 the achievable 
resolution with the same number of collocation points is of course lower. With 
linear mappings we achieve an absolute accuracy of 10~ 4 with a total of ~ 10000 
collocation points. This is already better than finite difference codes. However 
this accuracy can be increased with better adapted mappings. Since the solution 
ln(x 2 +y 2 ) changes much faster close to the origin than far away, one expects better 
convergence if more collocation points are placed close to the origin. This can be 
achieved with logarithmic mappings. Figure || shows that logarithmic mappings 
roughly double the convergence rate. At the highest resolution the difference is 
over four orders of magnitude. 

Table | compares the number of iterations Na s in the linear solver for different 
choices of the finite difference preconditioning matrix Afd [section [3.4.2 ]. With- 



out any preconditioning, Afd = 1, Nu s increases very quickly with the number 
of collocation points. If only second derivative terms are included in Afd then 
Ni ts grows more slowly. The number of iterations becomes almost independent of 



resolution if both second derivatives and the matching conditions (38d) and (38e) 
are incorporated into Afd- In this case ILU(2) preconditioning completely con- 
trols the largest eigenvalue X max , and it is then the smallest eigenvalue \ m in that 
causes the number of iterations to increase. It is typical that ILU has difficulties 
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TABLE 1 

Number of iterations Nu s in the linear solver for various kinds of preconditioning 
as a function of the number of collocation points N per dimension per subdomain. 
no PC stands for no preconditioning at all, PC V 2 includes only second derivative 
terms in Afd, whereas full PC includes also matching conditions. ILU(2) 
decomposition, exact and approximate inversion of Afd are explored. 
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controlling the long wavelength modes, and the problem is aggravated because the 
subdomains are only weakly coupled. Table [l] also contains results for exact inver- 
sion of Afd- Exact inversion controls \ m im too, and now the number of iterations 
is independent of resolution. However, direct inversion is an expensive process and 
only possible for small problems like this one. Finally, the table also contains the 
iteration count for approximate inversion of Afd, which is our preferred method 
for more complex geometries in 3 dimensions. It will be explained in detail in the 
next example. The eigenvalues in table [l] are estimates obtained by PETSc during 
the linear solve. 



4.2. Quasilinear Laplace equation with two excised spheres 

This solver was developed primarily for elliptic problems in numerical relativity. 
Accordingly we now solve an equation that has been of considerable interest in that 
field over the last few years (see e.g. Jl0|, ^ and references therein). Readers not 
familiar with relativity can simply view this problem as another test example for 
our new solveij^. We solve 

V 2 t/> + ^A 2 ^- 7 = (53) 
8 

for the function ip — ip(x, y, z). A 2 — A 2 (x, y, z) is a known, positive function, and 
the computational domain is R 3 with two excised spheres, 

V = R 3 - Si — S 2 . (54) 

1 The solution of this problem describes two black holes. The surfaces of the spheres Si : 2 
correspond to the horizons of the black holes, the function A 2 encodes information about spins 
and velocities of the black holes, and the solution tp measures the deviation from a flat spacetime. 
Far away from the black holes one has if) fa 1 with an almost Minkowski space, close to the holes 
we will find ifi ~ 2 with considerable curvature of spacetime. 
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FIG. 4 Cut through the domain decomposition for the computational domain 

©■ 

The radii r 1 / 2 and centers of the spheres are given. The function tp must satisfy 
a Dirichlet boundary condition at infinity, and Robin boundary conditions at the 
surface of each excised sphere: 

tp — > 1 as r — ► oo (55) 

%+tr Q " £dSi ' i=1 ' 2 (56) 

d/dr in Eq. (^6|) denotes the radial derivative in a coordinate system centered at 
the center of sphere i. 

Figure ^ sketches the domain decomposition used for the computational domain 
T>. We surround each excised sphere with a spherical shell. These two spherical 
shells are matched together with 5x3x3 rectangular blocks, where the two blocks 
that contain the excised spheres Si 2 are removed. Finally, we surround this struc- 
ture with a third spherical shell extending to very large outer radius. This gives a 
total of 46 subdomains, namely 3 shells and 43 rectangular blocks. 

In the inner spheres we use a log mapping for the radial coordinate. In the 
rectangular blocks, a combination of linear and logarithmic mappings is used similar 
to the 2D example in figure 0. In the outer sphere an inverse mapping is used which 
is well adapted to the fall-off behavior tp ~ 1 + ar _1 + • • • for large radii r. The 
outer radius of the outer spherical shell is chosen to be 10 9 or 10 10 and a Dirichlet 



boundary condition tj) = 1 is used to approximate Eq. (55) 



We now present two solutions with different sizes and locations of the excised 



spheres. In sections 4.2.3 to 4.2.6, we then discuss several topics including precon- 



ditioning and parallclization. 



4-2.1. Equal sized spheres 



First we choose two equal sized spheres with radii r% = ri = 1. The separation 
between the centers of the spheres is chosen to be 10, the outer radius of the outer 
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FIG. 5 Convergence of solution of (p3|)-(p6|) with radii of excised spheres r\ = r 2 = 
1 at separation 10. Nop, Lj n /, L%, M and AM are defined in the text immediately 
before and after Eq. (j57[). 



sphere is 10 9 . 

We solve equation ( |53| ) at several resolutions. The highest resolution uses 29 3 
collocation points in each rectangular block, 29 x 21 x 42 collocation points (radial, 
9 and <f> directions) in the inner spherical shells and 29 x 16 x 32 in the outer 
spherical shell. We use the difference in the solutions at neighboring resolutions as 
a measure of the error. We denote the pointwise maximum of this difference by 
L in f and the root-mean-square of the grid point values by L 2 - We also compute at 
each resolution the quantity 

which is the total mass of the binary black hole system. M will be needed in the 
comparison to a finite difference code below. The difference AM between M at 
neighboring resolutions is again a measure of the error of the solution. 

Figure shows the convergence of the solution tp with increasing resolution. 
Since the rectangular blocks and the spheres have different numbers of collocation 

1 /3 

points, the cube root of the total number of degrees of freedom, Njj F is used to label 
the x-axis. The exponential convergence is apparent. Because of the exponential 
convergence, and because Linf, L2 and AM utilize differences to the next lower 
resolution, the errors given in figure || are essentially the errors of the next lower 
resolution. Note that at the highest resolutions the approximation of the outer 
boundary condition ( fi5| ) by a Dirichlet boundary condition at finite outer radius 
10 9 becomes apparent: If we move the outer boundary to 10 10 , M changes by 
2 ■ 10 -9 which is of order 1/10 9 as expected. 

On the coarsest resolution ip = 1 is used as the initial guess. Newton-Raphson 
then needs six iterations to converge. On the finer resolutions we use the result of 
the previous level as the initial guess. These initial guesses are so good that one 
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Newton-Raphson iteration is sufficient on each resolution. 



4-2.2. Nonequal spheres — Different length scales 

With the multidomain spectral method it is possible to distribute resolution 
differently in each subdomain. This allows for geometries with vastly different 
length scales. As an example, we again solve equations (|5^)-(|5?j). The radii of the 
spheres are now r\ = 1 and r2 = 0.05, and the separation of the centers of the 
spheres is 100. The separation of the holes is thus 2000 times the radius of the 
smaller sphere. A finite difference code based on a Cartesian grid for this geometry 
would have to use adaptive mesh refinement. 

With the spectral solver, we still use the domain decomposition depicted in 
figure ||, but now the inner radii of the two inner spherical shells are different. The 
outer boundary of the outer sphere is at 10 10 . The number of collocation points in 
each subdomain is almost identical to the case with equal sized spheres of figure ^|, 
except we add 8 additional radial collocation points to the shell around the small 
excised sphere. As before we solve on different resolutions and compute the norms 
of the differences of the solution between different resolutions, as well as of the total 
mass M. The results are shown in figure ^|. The exponential convergence shows 
that the solver can handle the different length scales involved in this problem. 



4-2.3. Preconditioning 

The finite difference approximation Afd is initialized with the second deriva- 
tive terms, the matching conditions in touching domains [cf. Eqs. (38d) and ( |38e| )1, 
and with a FD approximation of the Robin boundary condition Eq. (|56|). Run- 
ning on a single processor, we could again define the preconditioner B via an ILU 
decomposition of Afd- However, when running on multiple processors, an ILU 
decomposition requires a prohibitive amount of communication, and block ASM 
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preconditioning]^ with one block per processor becomes favorable. After consid- 
erable experimentation, we settled on an implicit definition of B via its action on 
vectors. By_ is defined to be the approximate solution w of 

-4fdw = v. (58) 

Equation ( |58| ) is solved using a second, inner iterative solver, usually GMRES 
preconditioned with ILU (on a single processor) or a block ASM method (on mul- 
tiple processors). The inner solver is restricted to a fixed number of iterations. 
Applying a fixed number of iterations of an iterative solver is not a linear oper- 
ation, hence B represents no longer a matrix, but a nonlinear operator. In the 
outer linear solve we therefore use FGMRES[^6), a variant of GMRES that does 
not require that the preconditioner B is linear. With this preconditioning the outer 
linear solver needs about 20 iterations to reduce the residual of the linear solve by 
1CT 5 . 

More inner iterations reduce the number of iterations in the outer linear solve, 
but increase the computations per outer iteration. We found the optimal number 
of inner iterations to be between 15-20. In all the computations given in this paper 
we use 20 inner iterations, except for the 2-D example in table [I] where 10 inner 
iterations sufficed. 



4.2.4. Multigrid 

We also experimented with multigrid algorithms^, to improve the runtime. 
The potential for multigrid is fairly small, since the number of collocation points is 
so low. In this particular problem, an accuracy of better than 10 -6 can be achieved 
with 17 3 grid points per domain, which limits multigrid to at most two coarse levels. 

In addition it is not trivial to construct a restriction operator. The obvious and 
canonical choice for a restriction operator is to transform to spectral space, discard 
the upper half of the spectral coefficients, and transform back to real space on a 
coarser grid. This does not work here because the operator S uses the boundary 
points of each subdomain to convey information about matching between subdo- 
mains and about boundary conditions. Since these boundary points are filled using 
different equations than the interior points, the residual will typically be discon- 
tinuous between boundary points of a subdomain and interior points. Information 
about discontinuities is mainly stored in the high frequency part of a spectral ex- 
pansion and discarding these will thus result in a loss of information about matching 
between grids. However, the coarse grid correction of a multigrid algorithm is sup- 
posed to handle long wavelength modes of the solution. In our case these extend 
over several subdomains and thus information about matching is essential. Hence 
the simple approach of discarding the upper half of the frequencies discards the 
most vital parts of the information required by the coarse grid solver. 

Thus one seems to be compelled to use a real space restriction operator. Wc 



examined straight injection|25| which performed fairly well. The execution speed 



was comparable to the preconditioning with an inner linear solve as described in 



section 4.2.2. Since we did not achieve a significant code speed-up, there was no 



reason to keep the increased complexity of the multigrid algorithm. 



4-2.5. Comparison to a Finite Difference Code 

The computational domain Eq. (^4|) is challenging for standard finite difference 
codes based on a regular Cartesian grids for two reasons: 
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FIG. 7 Comparison of runtime vs. achieved accuracy for the new spectral solver 
and the highly optimized Cadez code. Plotted is the achieved accuracy of the total 
mass M vs. runtime needed to solve Eqs. (|53|)-(|56|) for both codes. 



1. The boundaries of the excised spheres do not coincide with coordinate bound- 
aries, so complicated interpolation/extrapolation is needed to satisfy the bound- 
ary condition ( |5^ ) (This problem led to a reformulation of the underlying 
physical problem without excised spheres 0). 

2. Resolving both the rapid changes close to the excised spheres and the fall-off 
behavior toward infinity requires a large number of grid points. 

In jlO^three different methods were developed to solve the boundary value 



problem (5^)-(p6[). The best one turned out to be a finite difference code based on 
a specially adapted coordinate system, the so-called Cadez coordinates. This code 
is an FAS-multigrid algorithm developed specifically for the differential equation 
(|53|). Care was taken that the truncation error is strictly even in grid-spacing 
h, thus allowing one to take two or three solutions at different resolutions and 
Richardson extrapolate to h — ► 0. The Cadez code is thus specially built for this 
equation in this geometry and it is unlikely that it can be significantly improved 
upon by any finite difference method. 

On the other hand, our spectral solver is general purpose. The domain decom- 
position is not restricted to R 3 with two excised spheres and we do not employ any 
specific optimizations for this particular problem. 

We compare these two codes for the configuration with equal sized spheres. 
Figure ^ shows a plot of runtime vs. achieved accuracy for both codes. These 
runs were performed on a single RS6000 processor; the highest resolution runs 
needed about 1GB of memory. The solid line labeled FD represents the results 
of the finite difference code without Richardson extrapolation. This line converges 
quadratically in grid spacing. The two stars represent Richardson extrapolated 
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values. The superiority of the spectral code is obvious. In accuracy, the spectral 
method outperforms even the finite difference code with Richardson extrapolation 
by orders of magnitude. Only very few finite difference codes allow for Richardson 
extrapolation, hence one should also compare the finite difference code without 
Richardson extrapolation to the spectral code: Then the lowest resolution of the 
spectral code is as accurate as the highest resolution of the finite difference code 
and faster by a factor of 20. Note also that the Cadez code cannot handle excised 
spheres of very different sizes or spheres that are widely separated. In particular, 
it cannot be used for the configuration in section 4.2.2, which is readily solved by 
our method. 



4-2.6. Parallelization 

Most computations during a solve are local to each subdomain; the operator S 
and the Jacobian J need communicate only matching information across subdo- 
mains. The inner linear solve is a completely standard parallel linear solve with 
an explicitly known matrix Afd- The software package PETSc has all the nec- 
essary capabilities to solve this system of equations efficiently in parallel. Hence 
parallelization by distributing different subdomains to different processors is fairly 
straightforward. 

However, different elements of the overall solve scale with different powers of 
the number of collocation points per dimension. If we denote the number of collo- 
cation points per dimension by N, the following scalings hold in three dimensions 
(the most interesting case): A spectral transform in a rectangular domain requires 
0(N 3 log N) operations; the transform in a sphere — where no useful fast transform 
for the Legendre polynomials is available — requires 0(N 4 ) operations; interpola- 
tion to one point is 0(N 3 ), so interpolation to all 0(N 2 ) boundary points scales 
like 0(N 5 ). Thus the optimal assignment of subdomains to processors is a function 
of N. Moreover, assignment of subdomains to processors is a discrete process — it 
is not possible to move an arbitrary fraction of computations from one processor to 
the another. One always has to move a whole subdomain with all the computations 
associated with it. This makes efficient load balancing difficult. 

At high resolution, the 0(N 5 ) interpolation consumes most of the runtime. Note 
that the outer spherical shell interpolates to 78 block surfaces, whereas the inner 
shells each interpolate to 6 block surfaces. These interpolations are parallelized by 
first distributing the data within each sphere to all processors. Then each processor 
interpolates a fraction of the points and the results are gathered again. 

We present scaling results in table 0. These results were obtained on the SP2 
of the physics department of Wake Forest University, and on NCSA's Platinum 
cluster, whose nodes have two Intel Pentium processors each. The listed times 
are cumulative times for solving at five different resolutions, each solve using the 
next lower solution as initial guess. Not included in these times is the set up in 
which the code determines which subdomain is responsible for interpolating which 
"overlapping" boundary point. Also not included is input/output. 

On the SP2 we achieve a scaling efficiency of 75%, whereas the Intel cluster 
has a lower scaling efficiency between around 54% (8 processors), and 41% (46 
processors). Given all the limitations mentioned above these numbers are very 
encouraging. 

Changing from serial to parallel execution degrades performance in two ways: 
First, the ILU preconditioner used within the approximate inner linear solve is 
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TABLE 2 

Runtime and scaling efficiency. Three processors host one shell and n\ blocks 
each, the remaining processors host ni blocks each. The last four columns refer to 

the Platinum cluster. 
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replaced by an overlapping block ASM preconditioner. Since this preconditioner is 
less efficient than ILU, the approximate inner linear solve is less accurate after its 
fixed number of iterations. Therefore the outer linear solve needs more iterations 
to converge to the required accuracy of 10~ 5 . The single processor code needs 19 
outer linear iterations, whereas the parallel codes need 23 or 24. Thus the maximally 
achievable scaling efficiency is limited to 19/23 ~ 0.83. The scaling efficiency on 
the SP2 is close to this limit. 

The second reason for low scaling efficiency is that we have not optimized the 
MPI calls in any way. The fact that the scaling efficiency on the cluster is much 
better if only one processor per node is used, suggests that the MPI calls are a 
bottleneck. Using both processors on a node doubles the communication load on 
that node which doubles the waiting time for MPI communication. The higher 
scaling efficiency on the SP2 which has faster switches also suggests that the runs 
on the PC cluster are communication limited. 

4.3. Coupled PDEs in nonflat geometry with excised spheres 

So far we have been considering only PDEs in a single variable. However, the 
definition of the operator S is not restricted to this case. In this section we present 
a solution of four coupled nonlinear PDEs. These equations are 



V 2 V - \i/>R - ^ 5 K 2 + A H Aij = °. ( 59 ) 

i = 1,2, 3 (60) 



A L V l - -^V'K + VjM ij = 0, 

3 3=1 



These equations are important for the binary black hole problem. The exact defi- 
nitions of the various terms can be found in 23 . For this paper, only the following 
information is necessary: V 2 is the Laplace operator on a nonflat three-dimensional 
manifold, hence Eq. (|59f) is an elliptic equation for ip. Al is a variant of the vector 
Laplacian, thus Eq. (p0|) is an elliptic equation for the vector V\i = 1,2,3. The 
variables Aij and A^ are functions of V 1 , so that Eqs. (|59| ) and 
solved simultaneously. The functions R, K and M y are given. 



have to be 
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The computational domain again has two spheres S± t 2 excised, 

V = R 3 - Sx - S 2 - (61) 

The radii of the excised spheres are r\ = r 2 = 2 and the separation between the 
centers is 10. 

We have Dirichlct boundary conditions on all boundaries: 

rl> = l, (62) 
U' = 0, i= 1,2,3. (63) 

We solve the problem again at several different resolutions. On the coarsest 
level two Newton-Raphson iterations are necessary, whereas the finer levels need 
only one Newton-Raphson iteration. The linear solver needs 30 iterations to reduce 
the residual by 10 5 . Details about constructing Afd for the nonflat differential 
operators V 2 and A l are given in appendix |b| 

The convergence of the solutions is shown in figure || We again find smooth 
exponential convergence. Recall that the plotted quantities essentially give the error 
of the next lower resolution. Hence the next-to-highest resolution run with a total 
of 78 3 « 500000 collocation points has a maximum pointwise error of ~ 0.5 • 10~ 7 . 
The wall clock time for that run is less than 2 hours on four RS 6000 processors. 

This problem has also been attacked with a finite difference code^O). The finite 
difference code required a runtime of 200 CPU hours (on 16 nodes of an Origin 
2000). The accuracy of the finite difference code seems to be comparable to the 
lowest resolution solve of our spectral solver, which took 16 minutes CPU time. 
Compared to the finite difference code the spectral code is almost embarrassingly 
fast. 
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5. IMPROVEMENTS 



The fact that spherical harmonics have fewer spectral coefficients than colloca- 
tion points causes a host of complications. We have to solve for mixed real-spectral 
coefficients, Eq. ([l9]). This complicates the operator S, and severely complicates 
real space finite difference preconditioning. A double Fourier series [^} for the an- 
gular variables might be superior to the expansion in spherical harmonics, since 
this avoids the necessity for mixed real-spectral coefficients. Moreover one can then 
use fast transforms for both <p and which might improve runtime in the spherical 
shells. 

We are working on cylinders as a third possible subdomain type. We also hope to 
use more general mappings that are no longer restricted to acting on each dimension 
separately. 

In terms of pure runtime, one should try to optimize the interpolation to bound- 
ary points of overlapping subdomains. This is the part of the code that has the 
worst scaling with the number of unknowns. Replacing the explicit summation of 
the series by one of the faster methods discussed in || should speed up the code 
tremendously. As was seen in the discussion of parallelization in section 4.2.6, the 
code seems to be communication limited on a PC cluster. One should therefore 
also optimize the MPI calls. For example, one could overlap communication with 
subdomain internal computations. 

Even without all these improvements our code is already fairly fast. This indi- 
cates the potential of our approach. 



6. CONCLUSION 



We have presented a new elliptic solver based on pseudo-spectral collocation. 
The solver uses domain decomposition with spherical shells and rectangular blocks, 
and can handle nonlinear coupled partial differential equations. 

We introduced a new method to combine the differential operator, the boundary 
conditions and matching between subdomains in one operator S. The equation 
iSu = is then solved with Newton- Raphson and an iterative linear solver. We 
show than one can employ standard software packages for nonlinear and linear 
solves and for preconditioning. 

The operator S has the added benefit that it is modular. Therefore adaption 
of the method to a new PDE or to new boundary conditions is easy; the user has 
only to code the differential operator and/or the new boundary conditions. We also 
discuss our treatment of mappings which decouples mappings from the actual code 
evaluating the differential operator, and from the code dealing with basis functions 
and details of spectral transforms. This modularity again simplifies extension of 
the existing code with e.g. new mappings. 

We demonstrated the capabilities of the new method with three examples on 
non-simply-connected computational domains in two and three dimensions and with 
one and four variables. We also demonstrated that the domain decomposition 
allows for vastly different length scales in the problem. During the examples we 
discussed various practical details like preconditioning and parallelization. Two 
of these examples were real applications from numerical relativity. We found the 
spectral code at the coarsest resolution to be as accurate as finite difference methods, 
but faster by one to two orders of magnitude. 
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APPENDIX A: PRECONDITIONING OF INVERSE MAPPINGS 



In a subdomain with inverse mapping that extends out to (almost) infinity, 
the outermost grid points are distributed very unevenly in physical space. This 
causes finitc-diffcrcncc approximations of derivatives to fail if they are based on the 
physical coordinate positions. Therefore we difference in the collocation coordinate 
X and apply the mapping via Eq. (|25|). At the collocation grid point Xi with grid 
spacing h- = Xi — and h + — Xi + ± — Xi we thus use 



du \ h + u,i_i (h + — h_)ui h-i 



dXJ i h-(h-+h+) h+h- h + (h- 



i+l 



(64) 
(65) 



\dX 2 ).~h-(h-+h+) h_h+ ' h+(h- + h+) 



If one substitutes Eqs. ( |64| ) and ( |65| ) into (|66|), then the coefficients of and 
u i+ i are the values that have to be entered into the FD-approximation matrix Afd- 
Even with this trick, preconditioning of the radial derivatives in an extremely 
stretched outer sphere is not yet sufficiently good. The preconditioned Jacobian 
BJ still contains eigenvalues of size ~ 40. The eigenmodes are associated with 
the highest radial mode in the outer sphere. We found that we can suppress these 
eigenvalues by damping this highest radial mode by a factor of 10 after the PETSc 
preconditioning is applied. 

APPENDIX B: PRECONDITIONING THE NONFLAT LAPLACIAN 

In a nonflat manifold, the Laplace operator of Eq. (|59|) contains second and first 
derivatives of ip, 

The coefficients g lJ and f l are known functions of position. Since our particular 
manifold is almost flat, we have g %% w 1, and g 1 ^ w for i ^ j. We base our 
preconditioning only on the diagonal part of (p37 



E'ViJ- (68) 

In rectangular blocks, Eq. ( |68| ) can be preconditioned without additional fill-in 
in Afd- Inclusion of the mixed second derivatives from Eq. ([37]) in Afd leads to 
a large fill-in of Afd- The increase in computation time due to the larger fill-in 
outweighs the improvement of convergence of the iterative solver in our problems. 

For the spherical shells, matters are complicated by the fact that we use mixed 



real-space/spectral space coefficients [recall Eq. (49)]. It is easy to precondition 
the angular piece of the flat space Laplacian, since our basis functions Y\ m are the 
eigenfunctions of this operator. Derivative operators with angle-dependent coeffi- 
cients lead to convolutions in spectral space and lead thus to a large fill-in in the 
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preconditioning matrix. Therefore we can only precondition radial derivatives with 
coefficients that are independent of the angles 9, (f>. We thus need to approximate 
Eq. ( |68| ) by a flat space angular Laplacian and constant coefficient radial derivatives. 
We proceed as follows. 

Rewrite Eq. ( |68| ) in spherical coordinates, 

3 f) 2 rfi r) 2 f) 2 Ft" 

~[ dx l <J" dty or 1 



G 8r ° 2 + ^ + F e — + F* — + F r — . (69) 
dOdr d6dr 89 dd> dr 



At each grid point, the various functions G and F can be computed via standard 
Cartesian to polar coordinate transformations. 

For each radial grid point n, average over the angles to obtain G® e , G[ r and F[. 
Now precondition as if G\ B were part of an angular piece of the flat space Laplacian, 
i.e. enter — 1(1 + l)Gf e /rf as the diagonal element belonging to the unknown un m . 
Further, precondition G^d 2 /dr 2 + F[d/dr with finite differences as described in 
appendix |X| Ignore all other terms in Eq. (p9|). 

The operator in Eq. ( |60| ) is defined by 

1 3 3 
A L V l = V 2 V l + i^V*V fe V & + ^2K l k V k , (70) 

k=l k=l 

V and Rij being the covariant derivative operator and Ricci tensor associated with 
the metric of the manifold. A^V 1 contains thus the nonflat Laplace operator acting 
on each component V 1 separately, plus terms coupling the different components 
which involve second derivatives, too. We precondition only the Laplace operator 
V 2 V l for each component V 1 separately as described above and ignore the coupling 
terms between different components. 
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